Add CallWithTimeout and related. #331

stevelinton · 2015-11-02T12:51:51Z

This PR adds a GAP level function

CALL_WITH_TIMEOUT( <seconds>, <useconds>, <fn>, <args> )

<seconds> and <useconds> must be small ints

The return value is

fail -- ran out of time
[] -- returned within the timeout, no result
[ <result> ] -- returned a result within the timeout

The timer is paused during a break loop (and might even by nestable) and reset on quitting from a break.
There are some support functions:

TIMEOUTS_SUPPORTED(); -- returns true if timeouts could be compiled in this system and false otherwise

STOP_TIMEOUT(); -- stops the timer and returns a length 2 list [<seconds>, <useconds>]
RESUME_TIMEOUT( <state> ); -- takes such a list and sets the timer to it and starts it.

these are used in the break loop handling.

This is all intended to be wrapped up at GAP level into something a little bit nicer. There are two implementations behind the scenes, depending on your OS -- one using setitimer and a nicer one using the newer POSIX interface -- timer-create() etc. The second of these could be made to work sensibly in HPC-GAP, I think, but is not available on a Mac.

I'd welcome testing.

ChrisJefferson · 2015-11-02T13:51:41Z

Some minor issues:

Passing 0,0 as the seconds/microseconds makes the loop go forever.

The following code segfaults gap:

gap> f := function() with true do od; end;;
gap> CALL_WITH_TIMEOUT(10,0,f,[]);
<press ctrl+c>
brk> CALL_WITH_TIMEOUT(10,0,f,[]);
<press ctrl+c again>
brk_2>
<press ctrl+d>
brk> return;
'gap' terminated by signal SIGABRT

stevelinton · 2015-11-02T15:16:47Z

Setting the kernel timer to 0 is how you unset it, which explains the first one. This is probably best dealt with in the GAP wrapper function.

I can also reproduce the crash on OS X. It would be useful to know if anyone sees it on Linux.

Also does any of this work under cygwin at all?

stevelinton · 2015-11-03T15:02:27Z

Chris's crash is messy. I'm not yet 100% sure of the details, but essentially when the user quits from the inner break loop, the longjump buffer used for the timeout is left with its value from the inner timeout, which is invalid since it comes from stack frame that we have left (via the separate longjump used for error handling) so when the outer timeout fires, we try to longjump to a place that no longer exists, and segfault.

The only really nice way to fix this is a full review of the whole error/exception handling concept, and the related business of saving and restoring reader and execution state. Ultimately this would be nice but it will be quite hard and most likely break a few things along the way.

In the meantime, we either add another piece of sticking plaster by trying to find all the right places to save and restore the timeout jump buffer, or we avoid the problem by banning nested timeouts, or something like that.

stevelinton · 2015-11-03T22:32:05Z

Have added sticking plaster. Not quite as complex as I thought.

stevelinton · 2015-11-04T11:16:01Z

As far as I know, this is now ready for merge, although more testing is always good. It has documentation and a basic testfile.

ChrisJefferson · 2015-11-04T12:00:08Z

Can make this code crash about 1 time in 3 with the following nasty invocation:

List([1..1000], x -> CALL_WITH_TIMEOUT(0, x, Intersection,[Group((1,2)),Group((2,3))]));

Looks like an incorrect longjmp is occuring (there is no backtrace, just longjmping into nowhere).

stevelinton · 2015-11-04T13:07:37Z

@ChrisJefferson I can't reproduce this.

ChrisJefferson · 2015-11-04T13:15:43Z

Did you try popping it in a while do .. od? I'll see if I can track down the issue myself (tried make distclean, still seems to pop up). Am on Mac (I assume you are too).

stevelinton · 2015-11-04T13:19:53Z

I can reproduce it now, although rarely. I'll have a look at the code. There might be a race condition around setting up the jump buffer.

ChrisJefferson · 2015-11-04T13:26:21Z

The problem is that a we try calling with 'NumAlarmJumpBuffers' = 0, which I think should never happen. This suggests an alarm is getting called after we disabled the alarms (might be something written somewhere that that can happen.

stevelinton · 2015-11-04T13:27:51Z

Think I know what the problem is. If the signal comes in after we execute the last statement of the function, we are left with InterruptExecStats having been called and SyAlarmHasGoneOff set to true,
but nowhere to longjump to.

I can fix.

ChrisJefferson · 2015-11-04T13:41:03Z

Not sure what is going on with travis's timing -- seems to be going crazy but not sure how it's going crazy.

stevelinton · 2015-11-04T13:59:45Z

OK. I think I've fixed the race condition. Travis timings probably relate to which timers see the system CPU time associated with calls to RunTime(). Possibly need to rethink how to test this code.

ChrisJefferson · 2015-11-04T16:41:44Z

I tried injecting some 'Runtimes()' calls into the test to see where the time is being spent, it looks like there isn't any great amount of time being spent anywhere:

########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 18:
# Input is:
Runtimes();
# Expected output:
# But found:
rec( system_time := 299, system_time_children := 0, user_time := 2523, 
  user_time_children := 0 )
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 19:
# Input is:
CallWithTimeout(5000,spinFor,1,1);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 21:
# Input is:
Runtimes();
# Expected output:
# But found:
rec( system_time := 307, system_time_children := 0, user_time := 2523, 
  user_time_children := 0 )
########

stevelinton · 2015-11-04T16:47:56Z

That’s 8ms of system time, which is 8000 microseconds, so the alarm might go off if that is being counted.

Steve

On 4 Nov 2015, at 16:41, Christopher Jefferson [email protected] wrote:

I tried injecting some 'Runtimes()' calls into the test to see where the time is being spent, it looks like there isn't any great amount of time being spent anywhere:

########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 18:

Input is:

Runtimes();

Expected output:

But found:

rec( system_time := 299, system_time_children := 0, user_time := 2523,
user_time_children := 0 )
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 19:

Input is:

CallWithTimeout(5000,spinFor,1,1);

Expected output:

[ 1 ]

But found:

fail
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 21:

Input is:

Runtimes();

Expected output:

But found:

rec( system_time := 307, system_time_children := 0, user_time := 2523,
user_time_children := 0 )
########

—
Reply to this email directly or view it on GitHub.

ChrisJefferson · 2015-11-04T16:53:34Z

Woops, I got my units wrong, you are right.

So just change spinFor to something like this?

gap> spinFor := function(ms, arg) local t; t := Runtime();
Syntax warning: New syntax used -- intentional? in stream line 1
spinFor := function(ms, arg) local t; t := Runtime();
                                      ^
> t := Runtimes().user_time + Runtimes().system_time;
> while Runtimes().user_time + Runtimes().system_time < t + ms do od;
> if Length(arg) >= 1
> then return arg[1]; else return; fi; end;
function( ms, arg ) ... end

rbehrends · 2015-11-05T17:39:44Z

I am wondering whether it would be a good idea to add SA_RESTART to the signal flags, since code that uses CallWithTimeOut() may not be prepared to deal with errors resulting from an interrupted syscall.

I also think there is a potential race condition (including in the existing code for handling SIGINT). Remember that the compiler is unaware of signals and can reorder assignments to RealExecStatCopied. This can be resolved (I think, haven't fully verified it) by making RealExecStatCopied a volatile sig_atomic_t variable (this may require testing for the availability of sig_atomic_t and falling back to int or char instead where unavailable).

stevelinton · 2015-11-05T20:37:59Z

Thanks.
While you’re here, am I right in thinking that the timer_create based implementation should be fairly easily adaptable to HPCGAP? I was thinking you could create one timer per thread, with the options that count thread user + system CPU and the signals ought to be delivered correctly. The setitimer version doesn’t look so adaptable, which is annoying for OSX users.

Steve

On 5 Nov 2015, at 17:39, Reimer Behrends [email protected] wrote:

I am wondering whether it would be a good idea to add SA_RESTART to the signal flags, since code that uses CallWithTimeOut() may not be prepared to deal with errors resulting from an interrupted syscall.

I also think there is a potential race condition (including in the existing code for handling SIGINT). Remember that the compiler is unaware of signals and can reorder assignments to RealExecStatCopied. This can be resolved (I think, haven't fully verified it) by making RealExecStatCopied a volatile sig_atomic_t variable (this may require testing for the availability of sig_atomic_t and falling back to int or char instead where unavailable).

—
Reply to this email directly or view it on GitHub.

rbehrends · 2015-11-05T22:54:00Z

Setting up signal delivery and the timing mechanism should work the same way as in the single-threaded version (assuming that per-thread CPU accounting is actually correct and not faked [1]); however, the actual signal delivery and handling is complicated when you have multiple threads (note that this is also the case, if, say, a GAP package starts a new thread [2]). The underlying problem is that POSIX doesn't guarantee that a signal is delivered to a specific thread, but allows the OS to pick any thread (even randomly) that doesn't block the signal.

The usual solution is to block the signal in all but a dedicated signal handler thread and have that signal be forwarded by that thread to the thread it was intended for via pthread_kill() or a dedicated interrupt delivery mechanism. As it happens, HPC-GAP already has both a signal handler thread and an interrupt delivery mechanism (though pthread_kill() should also work for this use case and result in a smaller diff). Look for SIGWAIT (and attendant mechanisms) in lib/thread.g and src/threadapi.c; InterruptThread() in src/thread.c is a general means to deliver a GAP-level interrupt to another thread (it's how SIGINT is currently handled).

One final caveat: The current implementation of CALL_WITH_TIMEOUT() seems to jump past CALL_WITH_CATCH() setjmps (correct me if I'm wrong). This may be intended, but in HPC-GAP, error handling there also pops entries off the lock stack. A longjmp would therefore have to copy the lock stack handling semantics (trivial, but it has to be done).

[1] This can be simulated, in the same way that profilers do sample-based profiling, but that may become hairy.

[2] I.e. if a GAP package did this, it would have to block signals that the main thread is prepared to handle; this is not a problem in HPC-GAP, which blocks the signals it handles itself in the main thread and worker threads, and new threads inherit the signal mask of the creating thread.

stevelinton · 2015-11-06T07:42:52Z

OK Thanks for the information.

olexandr-konovalov · 2015-11-06T17:49:34Z

@stevelinton tested on Windows VM, compiling and running GAP in Cygwin shell. It builds without warnings and errors, but then timeout.tst throws a bunch of Error, Could not set interval timer errors like here:

########> Diff in /home/jenkins/gap/tst/testinstall/timeout.tst, line 14:
# Input is:
CallWithTimeout(5000,spinFor,1);
# Expected output:
[  ]
# But found:
Error, Could not set interval timer
########

stevelinton · 2015-11-06T18:04:47Z

@alex-konovalov Great. So cygwin provides either setiimer or timer-create, but then it doesn't work.

Can you get the compile log from that system?

Even better, can you try to generate this error interactively and tell me what LastSystemError shows?

Thanks

ChrisJefferson · 2015-11-06T20:29:58Z

Thanks @alex-konovalov , quick google reveals cygwin doesn't do ITIMER_VIRTUAL, we'll have to use ITIMER_REAL on cygwin (which does measure wall time rather than process time, but should be a reasonable substitute)

olexandr-konovalov · 2015-11-06T20:51:48Z

Well, just replacing ITIMER_VIRTUAL by ITIMER_REAL in two places where it's used doesn't help...

ChrisJefferson · 2015-11-06T20:58:29Z

Still error in the same place?

olexandr-konovalov · 2015-11-06T21:09:23Z

@ChrisJefferson: yes :( I gave you the wrong fragment of code. It happens here:

  if (timer_settime(syTimer, 0, &tv, NULL)) {
    signal(TIMER_SIGNAL, SIG_DFL);
    ErrorReturnVoid("Could not set interval timer", 0L, 0L, "you can return to ignore");
  }

this is the version being used when #if HAVE_TIMER_CREATE && HAVE_SIGACTION

olexandr-konovalov · 2015-11-06T21:12:19Z

@stevelinton The No such file or directory. seems to be meaningful. I've set the breakpoint at timer_settime and that's what I observe:

gap> CallWithTimeout(5000,spinFor,1);

Breakpoint 1, timer_settime (timerid=0, flags=0, value=0x28c0a0, ovalue=0x0)
    at /usr/src/debug/cygwin-1.7.35-1/winsup/cygwin/timer.cc:338
338     {
(gdb) n
341       __try
(gdb) n
338     {
(gdb) n
341       __try
(gdb) n
344           if (tt->magic != TT_MAGIC)
(gdb) n
341       __try
(gdb) n
338     {
(gdb) n
341       __try
(gdb) n
__sjfault () at sigfe.s:233
233     sigfe.s: No such file or directory.

stevelinton · 2015-11-07T07:10:49Z

They must be using it to mean “no such timer”. I would guess that the timer-create call in InitAlarms, the return value of which, mea culpa, is not being checked, must be failing.

I imagine it is more-or-less the same as what Chris reported — CPU timing isn’t supported, so we have to use CLOCK_REALTIME instead of one of the others.

Steve

On 6 Nov 2015, at 21:12, Alexander Konovalov [email protected] wrote:

@stevelinton The No such file or directory. seems to be meaningful. I've set the breakpoint at timer_settime and that's what I observe:

gap> CallWithTimeout(5000,spinFor,1);

Breakpoint 1, timer_settime (timerid=0, flags=0, value=0x28c0a0, ovalue=0x0)
at /usr/src/debug/cygwin-1.7.35-1/winsup/cygwin/timer.cc:338
338 {
(gdb) n
341 __try
(gdb) n
338 {
(gdb) n
341 __try
(gdb) n
344 if (tt->magic != TT_MAGIC)
(gdb) n
341 __try
(gdb) n
338 {
(gdb) n
341 __try
(gdb) n
__sjfault () at sigfe.s:233
233 sigfe.s: No such file or directory.

—
Reply to this email directly or view it on GitHub.

ChrisJefferson · 2015-11-07T09:39:58Z

My windows install is currently broken -- once I fix it, I'll figure out exactly what does and doesn't work in cygwin.

stevelinton · 2015-11-13T16:08:55Z

The current version should in theory work on cygwin.If it does, then as far as I know it's ready to merge, although it might be sensible to squash the commits a bit.

ChrisJefferson · 2015-11-13T16:36:45Z

Yep, works fine with cygwin

olexandr-konovalov · 2015-11-13T20:47:35Z

@stevelinton @ChrisJefferson thanks - that was the only issue remaining, so let's merge this then.

Modify test file to hopefully please travis Add warning to documentation that a timeout may leave things inconsistent. Some robsutness improvements suggested by Reimer Check return value from timer_create

Make alarms work with TakeInterrupt() This allows long, but cooperative kernel functions to be interrupted.

Add CallWithTimeout and related.

olexandr-konovalov · 2015-11-17T21:05:43Z

No further comments for 4 days - merged now. Thanks

olexandr-konovalov · 2015-11-19T01:21:15Z

@stevelinton interestingly, the test passes on Travis infrastructure, but fails on all three machines that we use for Jenkins nightly tests:

########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 14:
# Input is:
CallWithTimeout(5000,spinFor,1);
# Expected output:
[  ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 18:
# Input is:
CallWithTimeout(5000,spinFor,1,1);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 20:
# Input is:
CallWithTimeoutList(5000,spinFor,[1,1]);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 24:
# Input is:
CallWithTimeoutList(5000,spinFor,[1]);
# Expected output:
[  ]
# But found:
fail
########

olexandr-konovalov · 2015-11-20T23:49:29Z

@stevelinton could it be a race condition somewhere? The longer is the timeout, the more robust it is. For example, I have

gap> CallWithTimeout(50000,spinFor,1);
[  ]

each time, with the 1st argument being 10000 I can get fail time to time, and with the 1st argument being only 1000 I got [ ] one time out of twenty.

rbehrends · 2015-11-21T01:38:33Z

Most likely, it's just the timeout being hit. As I understand it, the value is given in microseconds, and 5000 µs = 5 ms can easily fail if the machine is just a bit busy (or if the OS doesn't support high resolution timers).

olexandr-konovalov · 2015-11-21T20:43:18Z

@rbehrends I think you're right... so I suggest to increase the timeout in the failing calls from 5000 to 50000 to make the test reproducible.

coveralls · 2017-09-05T07:57:13Z

Changes Unknown when pulling 3ac7ce3 on stevelinton:sl/alarm into ** on gap-system:master**.

stevelinton mentioned this pull request Nov 2, 2015

CallWithTimeLimit (was: Alarm timers) #315

Closed

stevelinton force-pushed the sl/alarm branch 2 times, most recently from b9e9a82 to ff39f83 Compare November 4, 2015 11:14

stevelinton changed the title ~~Add CALL_WITH_TIMEOUT and related functionality (WIP, do not merge)~~ Add CallWithTimeout and related. Nov 4, 2015

stevelinton force-pushed the sl/alarm branch from ff39f83 to f2c7d65 Compare November 4, 2015 13:05

stevelinton closed this Nov 4, 2015

stevelinton reopened this Nov 4, 2015

stevelinton mentioned this pull request Nov 5, 2015

Nested timeouts #342

Merged

stevelinton force-pushed the sl/alarm branch 3 times, most recently from ba07060 to b8fdb15 Compare November 13, 2015 11:47

stevelinton force-pushed the sl/alarm branch from b8fdb15 to 6a61af2 Compare November 13, 2015 16:43

stevelinton added 4 commits November 16, 2015 20:59

Introduce CallWithTimeout and CallWithTimeoutList and support functions.

9bd275a

Documentation and a test file

67a2800

Fix a number of minor bugs and improve robustness

730ec9e

Modify test file to hopefully please travis Add warning to documentation that a timeout may leave things inconsistent. Some robsutness improvements suggested by Reimer Check return value from timer_create

Fix for Cygwin as a far as Cygwin allows

3ac7ce3

Make alarms work with TakeInterrupt() This allows long, but cooperative kernel functions to be interrupted.

stevelinton force-pushed the sl/alarm branch from 6a61af2 to 3ac7ce3 Compare November 16, 2015 21:00

olexandr-konovalov pushed a commit that referenced this pull request Nov 17, 2015

Merge pull request #331 from stevelinton/sl/alarm

3ab25a4

Add CallWithTimeout and related.

olexandr-konovalov merged commit 3ab25a4 into gap-system:master Nov 17, 2015

olexandr-konovalov added this to the GAP 4.8.1 milestone Nov 20, 2015

rbehrends mentioned this pull request Nov 25, 2015

Hpcgap merge nov25 #375

Merged

ChrisJefferson mentioned this pull request Nov 27, 2015

HPC-GAP doesn't support CallWithTimeout #377

Closed

Add CallWithTimeout and related. #331

Add CallWithTimeout and related. #331

Conversation

stevelinton commented Nov 2, 2015

ChrisJefferson commented Nov 2, 2015

stevelinton commented Nov 2, 2015

stevelinton commented Nov 3, 2015

stevelinton commented Nov 3, 2015

stevelinton commented Nov 4, 2015

ChrisJefferson commented Nov 4, 2015

stevelinton commented Nov 4, 2015

ChrisJefferson commented Nov 4, 2015

stevelinton commented Nov 4, 2015

ChrisJefferson commented Nov 4, 2015

stevelinton commented Nov 4, 2015

ChrisJefferson commented Nov 4, 2015

stevelinton commented Nov 4, 2015

ChrisJefferson commented Nov 4, 2015

stevelinton commented Nov 4, 2015

Input is:

Expected output:

But found:

Input is:

Expected output:

But found:

Input is:

Expected output:

But found:

ChrisJefferson commented Nov 4, 2015

rbehrends commented Nov 5, 2015

stevelinton commented Nov 5, 2015

rbehrends commented Nov 5, 2015

stevelinton commented Nov 6, 2015

olexandr-konovalov commented Nov 6, 2015

stevelinton commented Nov 6, 2015

ChrisJefferson commented Nov 6, 2015

olexandr-konovalov commented Nov 6, 2015

ChrisJefferson commented Nov 6, 2015

olexandr-konovalov commented Nov 6, 2015

olexandr-konovalov commented Nov 6, 2015

stevelinton commented Nov 7, 2015

ChrisJefferson commented Nov 7, 2015

stevelinton commented Nov 13, 2015

ChrisJefferson commented Nov 13, 2015

olexandr-konovalov commented Nov 13, 2015

olexandr-konovalov commented Nov 17, 2015

olexandr-konovalov commented Nov 19, 2015

olexandr-konovalov commented Nov 20, 2015

rbehrends commented Nov 21, 2015

olexandr-konovalov commented Nov 21, 2015

coveralls commented Sep 5, 2017